{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multi-Class Text Classification for make-up products using Doc2Vec \n",
    "\n",
    "Multi-Class Text Classification to make-up products based on their description and categories by using Doc2Vec vectors.\n",
    "\n",
    "Doc2vec is a method of vector representation of entire documents, not individual words. By document, you can mean a single sentence, paragraph, or an entire book. Doc2Vec architecture has two algorithms. One of the them is called Distributed Bag of Words (DBOW). The second algorithm is “distributed memory” (DM).\n",
    "\n",
    "In this project, we used Doc2Vec to get the document vectors and we used as input to the classification model. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Importing packages and loading data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "import re\n",
    "from time import time\n",
    "import nltk\n",
    "from nltk.corpus import stopwords\n",
    "from nltk.stem.wordnet import WordNetLemmatizer\n",
    "from nltk.tokenize import sent_tokenize, word_tokenize\n",
    "\n",
    "from gensim.models.doc2vec import Doc2Vec, TaggedDocument\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.svm import LinearSVC\n",
    "from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, f1_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_type</th>\n",
       "      <th>description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>lip_liner</td>\n",
       "      <td>Lippie Pencil A long-wearing and high-intensit...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>lipstick</td>\n",
       "      <td>Blotted Lip Sheer matte lipstick that creates ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>lipstick</td>\n",
       "      <td>Lippie Stix Formula contains Vitamin E, Mango,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>foundation</td>\n",
       "      <td>Developed for the Selfie Age, our buildable fu...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>lipstick</td>\n",
       "      <td>All of our products are free from lead and hea...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  product_type                                        description\n",
       "0    lip_liner  Lippie Pencil A long-wearing and high-intensit...\n",
       "1     lipstick  Blotted Lip Sheer matte lipstick that creates ...\n",
       "2     lipstick  Lippie Stix Formula contains Vitamin E, Mango,...\n",
       "3   foundation  Developed for the Selfie Age, our buildable fu...\n",
       "4     lipstick  All of our products are free from lead and hea..."
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('C:\\\\Python Scripts\\\\API_products\\\\products_description.csv', header=0,index_col=0)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(906, 2)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "94257\n"
     ]
    }
   ],
   "source": [
    "print(df['description'].apply(lambda x: len(x.split(' '))).sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have 94 257 words in the data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Text data clean"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "product_type    0\n",
       "description     0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#checking missing values\n",
    "df.isnull().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "#changing data type\n",
    "df['description'] = df['description'].astype(str)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "#grouping data to smaller number of categories:\n",
    "df.loc[df['product_type'].isin(['lipstick','lip_liner']),'product_type'] = 'lipstic'\n",
    "df.loc[df['product_type'].isin(['blush','bronzer']),'product_type'] = 'contour'\n",
    "df.loc[df['product_type'].isin(['eyeliner','eyeshadow','mascara','eyebrow']),'product_type'] = 'eye_makeup'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "eye_makeup     367\n",
       "lipstic        176\n",
       "foundation     159\n",
       "contour        144\n",
       "nail_polish     60\n",
       "Name: product_type, dtype: int64"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.product_type.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the next step we remove non-alphabetic characters, the stopwords and lemmatizing for each line of text:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cleanText(words):\n",
    "    \"\"\"The function to clean text\"\"\"\n",
    "    words = re.sub(\"[^a-zA-Z]\",\" \",words)\n",
    "    text = words.lower().split()\n",
    "    return \" \".join(text)\n",
    "\n",
    "df['description'] = df['description'].apply(cleanText)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "stop = stopwords.words('english')\n",
    "lem = WordNetLemmatizer()\n",
    "\n",
    "\n",
    "def remove_stopwords(text):\n",
    "    \"\"\"The function to removing stopwords\"\"\"\n",
    "    text = [word.lower() for word in text.split() if word.lower() not in stop]\n",
    "    return \" \".join(text)\n",
    "\n",
    "def word_lem(text):\n",
    "    \"\"\"The function to apply lemmatizing\"\"\"\n",
    "    lem_text = [lem.lemmatize(word) for word in text.split()]\n",
    "    return \" \".join(lem_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['description'] = df['description'].apply(remove_stopwords)\n",
    "df['description'] = df['description'].apply(word_lem)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_type</th>\n",
       "      <th>description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>lipstic</td>\n",
       "      <td>lippie pencil long wearing high intensity lip ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>lipstic</td>\n",
       "      <td>blotted lip sheer matte lipstick creates perfe...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>lipstic</td>\n",
       "      <td>lippie stix formula contains vitamin e mango a...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>foundation</td>\n",
       "      <td>developed selfie age buildable full coverage n...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>lipstic</td>\n",
       "      <td>product free lead heavy metal parabens phthala...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  product_type                                        description\n",
       "0      lipstic  lippie pencil long wearing high intensity lip ...\n",
       "1      lipstic  blotted lip sheer matte lipstick creates perfe...\n",
       "2      lipstic  lippie stix formula contains vitamin e mango a...\n",
       "3   foundation  developed selfie age buildable full coverage n...\n",
       "4      lipstic  product free lead heavy metal parabens phthala..."
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['description'] = df['description'].astype(str)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "60892\n"
     ]
    }
   ],
   "source": [
    "print(df['description'].apply(lambda x: len(x.split(' '))).sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After text cleaning and removing stop words, we have only 60 892 words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "#save clean data\n",
    "df.to_csv('C:\\Python Scripts\\API_products\\products_clean.csv', encoding='utf-8')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data preparation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we split the text into training and testing sets. Then we create a **TaggedDocument** object because this is what Doc2Vec wants as input."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "train, test = train_test_split(df, test_size=0.3, random_state=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_tag = train.apply(lambda x: TaggedDocument(words=word_tokenize(x['description']), tags=[x.product_type]), axis=1)\n",
    "\n",
    "test_tag = test.apply(lambda x: TaggedDocument(words=word_tokenize(x['description']), tags=[x.product_type]), axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "TaggedDocument(words=['pressed', 'foundation', 'marienatie', 'providing', 'silky', 'flawless', 'finish', 'provides', 'great', 'coverage', 'protects', 'skin', 'spf', 'titanium', 'dioxide', 'act', 'absorbent', 'oil', 'jojoba', 'oil', 'help', 'cleanse', 'moisturize', 'skin'], tags=['foundation'])"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_tag[20]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "TaggedDocument(words=['let', 'eye', 'naturally', 'pop', 'b', 'smudged', 'subtle', 'eye', 'color', 'add', 'tint', 'color', 'base', 'lash', 'organic', 'cream', 'eye', 'color', 'b', 'smudged', 'eliminates', 'inevitable', 'uneven', 'line', 'traditional', 'eyeliner', 'require', 'expert', 'blending', 'technique', 'messy', 'powder', 'based', 'shadow', 'simply', 'smudge', 'along', 'lash', 'line', 'color', 'stay', 'place', 'long', 'lasting', 'look'], tags=['eye_makeup'])"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_tag[10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Getting the feature vector from doc2vec model\n",
    "\n",
    "In this step we initialize the gensim doc2vec model. Doc2Vec architecture is also similar to word2vec and has two algorithms like word2vec and they are the corresponding algorithms for those two algorithms. One of the them is called Distributed Bag of Words (DBOW) which is similar to “Skip-gram” (SG) model in word2vec except that additional paragraph id vector is added. The second algorithm is “distributed memory” (DM) which is similar to “Continuous bag of words” CBOW in word vector.\n",
    "\n",
    "#### Distributed Bag of Words (DBOW)\n",
    "\n",
    "First, we instantiate a doc2vec model — Distributed Bag of Words (DBOW). We set the following parameters: \n",
    "\n",
    "- dm=0 , distributed bag of words (DBOW) is used;\n",
    "- vector_size = 100, word embeddings will have shape of;\n",
    "- window = 2, model will try to predict every second word;\n",
    "- sample=0 , the threshold for configuring which higher-frequency words are randomly down sampled;\n",
    "- min_count=2, ignores all words with total frequency lower than this.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\PC\\Anaconda3\\lib\\site-packages\\gensim\\models\\base_any2vec.py:743: UserWarning: C extension not loaded, training will be slow. Install a C compiler and reinstall gensim for fast training.\n",
      "  \"C extension not loaded, training will be slow. \"\n"
     ]
    }
   ],
   "source": [
    "doc_model = Doc2Vec(dm=0, vector_size=100, min_count=2, window=2, sample = 0)\n",
    "               \n",
    "doc_model.build_vocab(train_tag)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "43497"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc_model.corpus_total_words"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Training for 30 epochs: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wall time: 10min 32s\n"
     ]
    }
   ],
   "source": [
    "%time doc_model.train(train_tag, total_examples=doc_model.corpus_count, epochs=30) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\PC\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).\n",
      "  \"\"\"Entry point for launching an IPython kernel.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('beneficial', 0.3377334177494049),\n",
       " ('entire', 0.30722713470458984),\n",
       " ('smoother', 0.2944685220718384),\n",
       " ('fix', 0.2937236726284027),\n",
       " ('intensify', 0.2861747443675995),\n",
       " ('b', 0.2843270003795624),\n",
       " ('page', 0.2822366952896118),\n",
       " ('withstand', 0.279226154088974),\n",
       " ('terracotta', 0.2777043879032135),\n",
       " ('isopropylparaben', 0.2721264958381653)]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc_model.most_similar('blush')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "#save model\n",
    "doc_model.save('model.doc2vec')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Building the Final Vector Feature for the Classifier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "def vector_for_learning(model, input_docs):\n",
    "    sents = input_docs\n",
    "    targets, feature_vectors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=20)) for doc in sents])\n",
    "    return targets, feature_vectors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_train, X_train = vector_for_learning(doc_model, train_tag)\n",
    "y_test, X_test = vector_for_learning(doc_model, test_tag)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training the Classifier\n",
    "\n",
    "We choose Logistic Regression Classifier and Linear Support Vector Machine.\n",
    "\n",
    "***Logistic Regression with DBOW***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\PC\\Anaconda3\\lib\\site-packages\\sklearn\\linear_model\\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n",
      "  FutureWarning)\n",
      "C:\\Users\\PC\\Anaconda3\\lib\\site-packages\\sklearn\\linear_model\\logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.\n",
      "  \"this warning.\", FutureWarning)\n"
     ]
    }
   ],
   "source": [
    "log_reg = LogisticRegression(n_jobs=1, C=5)\n",
    "log_reg.fit(X_train, y_train)\n",
    "y_pred = log_reg.predict(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Testing the Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Testing accuracy 0.9154411764705882\n",
      "Testing F1 score: 0.9149097862036379\n"
     ]
    }
   ],
   "source": [
    "print('Testing accuracy %s' % accuracy_score(y_pred, y_test))\n",
    "print('Testing F1 score: {}'.format(f1_score(y_test, y_pred, average='weighted')))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "     contour       0.84      0.90      0.87        40\n",
      "  eye_makeup       0.90      0.98      0.94       115\n",
      "  foundation       0.98      0.87      0.92        53\n",
      "     lipstic       0.93      0.82      0.87        51\n",
      " nail_polish       1.00      0.92      0.96        13\n",
      "\n",
      "    accuracy                           0.92       272\n",
      "   macro avg       0.93      0.90      0.91       272\n",
      "weighted avg       0.92      0.92      0.91       272\n",
      "\n"
     ]
    }
   ],
   "source": [
    "ytest = np.array(y_test)\n",
    "print(classification_report(ytest, y_pred))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***Linear Support Vector Machine with DBOW***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,\n",
       "          intercept_scaling=1, loss='squared_hinge', max_iter=1000,\n",
       "          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,\n",
       "          verbose=0)"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "svm = LinearSVC()\n",
    "svm.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Testing accuracy 0.9154411764705882\n",
      "Testing F1 score: 0.9154729253555255\n"
     ]
    }
   ],
   "source": [
    "preds = svm.predict(X_test)\n",
    "print('Testing accuracy %s' % accuracy_score(preds, y_test))\n",
    "print('Testing F1 score: {}'.format(f1_score(y_test, preds, average='weighted')))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "     contour       0.82      0.90      0.86        40\n",
      "  eye_makeup       0.91      0.97      0.94       115\n",
      "  foundation       1.00      0.87      0.93        53\n",
      "     lipstic       0.91      0.84      0.88        51\n",
      " nail_polish       1.00      0.92      0.96        13\n",
      "\n",
      "    accuracy                           0.92       272\n",
      "   macro avg       0.93      0.90      0.91       272\n",
      "weighted avg       0.92      0.92      0.92       272\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(classification_report(ytest, preds))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Distributed Memory (DM)\n",
    "\n",
    "Now, we instantiate a Distributed Memory (DM) with a vector size with 100 words and iterating over the training corpus 30 times.\n",
    "\n",
    "Distributed Memory (DM) works like a memory that remembers what is missing from the current context or as the topic of the paragraph. While the word vectors represent the concept of a word, the document vector intends to represent the concept of a document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\PC\\Anaconda3\\lib\\site-packages\\gensim\\models\\base_any2vec.py:743: UserWarning: C extension not loaded, training will be slow. Install a C compiler and reinstall gensim for fast training.\n",
      "  \"C extension not loaded, training will be slow. \"\n"
     ]
    }
   ],
   "source": [
    "dm_model = Doc2Vec(dm=1, vector_size=100, min_count=2, window=2, sample = 0, negative=5, alpha=0.025, min_alpha=0.001)\n",
    "dm_model.build_vocab(train_tag)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "43497"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dm_model.corpus_total_words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wall time: 16min 19s\n"
     ]
    }
   ],
   "source": [
    "%time dm_model.train(train_tag, total_examples=dm_model.corpus_count, epochs=30) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "dm_model.save('dm_model.doc2vec')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Extract training vectors:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_train_dm, X_train_dm = vector_for_learning(dm_model, train_tag)\n",
    "y_test_dm, X_test_dm = vector_for_learning(dm_model, test_tag)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Training and testing Classifiers:\n",
    "\n",
    "***Logistic Regression with DM***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\PC\\Anaconda3\\lib\\site-packages\\sklearn\\linear_model\\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n",
      "  FutureWarning)\n",
      "C:\\Users\\PC\\Anaconda3\\lib\\site-packages\\sklearn\\linear_model\\logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.\n",
      "  \"this warning.\", FutureWarning)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Testing accuracy 0.9007352941176471\n",
      "Testing F1 score: 0.9005217056526574\n"
     ]
    }
   ],
   "source": [
    "log_reg = LogisticRegression(n_jobs=1, C=5)\n",
    "log_reg.fit(X_train_dm, y_train_dm)\n",
    "pred = log_reg.predict(X_test_dm)\n",
    "\n",
    "print('Testing accuracy %s' % accuracy_score(y_test_dm, pred))\n",
    "print('Testing F1 score: {}'.format(f1_score(y_test_dm, pred, average='weighted')))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "     contour       0.78      0.90      0.84        40\n",
      "  eye_makeup       0.93      0.97      0.95       115\n",
      "  foundation       0.87      0.87      0.87        53\n",
      "     lipstic       0.95      0.78      0.86        51\n",
      " nail_polish       1.00      0.92      0.96        13\n",
      "\n",
      "    accuracy                           0.90       272\n",
      "   macro avg       0.91      0.89      0.89       272\n",
      "weighted avg       0.90      0.90      0.90       272\n",
      "\n"
     ]
    }
   ],
   "source": [
    "ytest = np.array(y_test_dm)\n",
    "print(classification_report(ytest, pred))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***Linear Support Vector Machine with DM***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\PC\\Anaconda3\\lib\\site-packages\\sklearn\\svm\\base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.\n",
      "  \"the number of iterations.\", ConvergenceWarning)\n"
     ]
    }
   ],
   "source": [
    "svm = LinearSVC()\n",
    "svm.fit(X_train_dm, y_train_dm)\n",
    "pred_y = svm.predict(X_test_dm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Testing accuracy 0.9007352941176471\n",
      "Testing F1 score: 0.9004546938043879\n"
     ]
    }
   ],
   "source": [
    "print('Testing accuracy %s' % accuracy_score(pred_y, y_test_dm))\n",
    "print('Testing F1 score: {}'.format(f1_score(y_test_dm, pred_y, average='weighted')))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "     contour       0.80      0.88      0.83        40\n",
      "  eye_makeup       0.95      0.97      0.96       115\n",
      "  foundation       0.84      0.87      0.85        53\n",
      "     lipstic       0.93      0.78      0.85        51\n",
      " nail_polish       1.00      0.92      0.96        13\n",
      "\n",
      "    accuracy                           0.90       272\n",
      "   macro avg       0.90      0.88      0.89       272\n",
      "weighted avg       0.90      0.90      0.90       272\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(classification_report(ytest, pred_y))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From received results we can see the obtained accuracy of DBOW model is better than DM model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Conclusion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this project, we used training set to train doc2vec classifier for our make-up products description. We have trained doc2vec ( DBOW and DM) model and decided to use Logistic Regression and Linear Support Vector Machine models, which received the best results in previous testing model (combination Bag of Words with TF-IDF). We were able to achieve 92% accuracy for SVM and Logistic Regression with DBOW method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
